We are using the same dataset from our first project. The dataset contains 9578 observations and 14 variables and is based on loans from LendingClub, a peer-to-peer lending company, that were made from 2007 to 2015. In summary, our EDA found:
credit.policy and other numeric variables such as
int.rate, fico, and
inq.last.6mths. Additionally, there was no clearly
established relationships between not.fully.paid and other
numeric variables (except for credit.policy).purpose,
credit.policy, and not.fully.paid.delinq.2yrs, their
mean significantly varies for different categories
ofpurpose.We obtained the dataset from Kaggle here: https://www.kaggle.com/datasets/urstrulyvikas/lending-club-loan-data-analysis
Our work is stored on our team GitHub here: https://github.com/jschild01/JMB_DATS_6101
From our EDA in project 1 we know that the credit.policy
and not.fully.paid variables function as logicals, and the
purpose variable functions as a factor with 7 levels. We
will start by formally converting these variables from numerical
data-types in R to logical and the factor data-types, respectively.
# We determined this makes sense to do from our EDA
loans$credit.policy <- as.logical(loans$credit.policy)
loans$not.fully.paid <- as.logical(loans$not.fully.paid)
loans$purpose <- as.factor(loans$purpose)
Of the 14 variables in the dataset, 11 are numeric, 2 are logical, and 1 is a factor.
From our EDA we saw that the overwhelming majority of loans had a
value of 0 for delinq.2yrs and pub.rec. The
variables might be more useful in terms of having stronger correlations
and a larger impact in models if we create logical versions of these
variables. We will introduce has.delinq.2yrs and
has.pub.rec based on whether or not the value is 0 or
greater than 0.
# It might help to turn these variables into logicals since most are 0
loans <- loans %>%
mutate(has.delinq.2yrs = delinq.2yrs > 0,
has.pub.rec = pub.rec > 0)
From our EDA we saw that the revol.bal variable had a
wide range of values as well as outlier issues. We wondered if it would
work better if we took the natural log of it, similar to how the income
provided in the dataset (the log.annual.inc variable) is in
the form of the natural log. However, some values for
revol.bal are 0, and taking the natural log of that results
in -Inf. One way to deal with this without removing those values is to
add 1 to all of the values for revol.bal before taking the
natural log.
Given the range of these values (the IQR is 1.50625^{4}) adding one
should have a negligible effect, and adding one to all values keeps the
data consistent. In case it is significant that the
revol.bal is 0 we will add a logical variable
has.revol.bal based on whether the value of
revol.bal is 0 or greater than 0.
# Some loans have a revol.bal value of 0, the natural log of which is -Inf
# Adding 1 to all values before taking the natural log resolves this issue by nudging the value a very small amount
# In case it was meaningful if the loan had a revol.bal value of 0 versus >0, we can add a logical to account for that
loans <- loans %>%
mutate(log.revol.bal = log(revol.bal+1),
has.revol.bal = revol.bal > 0)
Now that we have added some new variables, it would be helpful to look at a correlation plot and make some comparisons to assess whether they would be more useful going forward than their original forms.
# A new correlation matrix with the new and transformed variables
# For our correlation matrix we want to include everything but the purpose variable
# We can put the new variables together at the top
loans_correlation_matrix_new <- loans %>%
select(-purpose) %>%
select(revol.bal, log.revol.bal, has.revol.bal, has.delinq.2yrs, has.pub.rec, everything()) %>%
cor()
# The mixed correlation plot makes a nice visualization
corrplot.mixed(loans_correlation_matrix_new, tl.pos = "lt")
Some notable observations of this:
1. has.delinq.2yrs doesn’t have any strong correlations,
and the 2 potentially useful ones (int.rate and
fico) are almost exactly the same as
delinq.2yrs.
Similarly, has.pub.rec doesn’t have any strong
correlations, and the 2 potentially useful ones (int.rate
and fico) are the same as pub.rec.
log.revol.bal has a stronger correlation with
dti and a much stronger correlation with
revol.util compared to revol.bal.
log.revol.bal has a weaker correlation with
credit.policy compared to revol.bal.
has.revol.bal does not have a significant
correlation with anything (log.revol.bal excepted) except
revol.util, where the correlation is weaker than that of
log.revol.bal.
From this we can conclude that converting delinq.2yrs
and pub.rec to logicals did not add any advantage for our
dataset. We can also conclude that the has.revol.bal
variable is not necessary and will not add value to our analysis beyond
what log.revol.bal already covers. Successfully eliminating
these as possibilities increases our confidence in the use of their
original structures.
Lastly, we see that taking the natural log of revol.bal
might be useful, as log.revol.bal has some stronger
correlations than revol.bal. We will note this as we
proceed with our modeling.
These are updated histograms, boxplots, and Q-Q plots with the
log.revol.bal added variable. We will use these to briefly
review the EDA we did for the first project.
Since one of our variables of intrest, credit.policy is
a logical, we can treat is as a 2-level factor and build a
classification tree model using the variables it has significant
correlation with, in this case int.rate, fico,
and inq.last.6mths.
For this variable we are interested in the loans where the value is
FALSE. That is, loans that do not meet the credit
underwriting criteria, and therefore are more at risk of defaulting.
These loans are in the minority. We will rename the FALSE
values to Fail, indicating that the loan fails to meet the
underwriting criteria, and rename the TRUE values to
Meets, indicating that the loan meets the underwriting
criteria.
# Since we are renaming some values let's make a separate dataset for this model
loans_tree <- loans
# First convert the credit.policy variable from a logical to a factor, then rename the levels
loans_tree$credit.policy <- as.factor(loans_tree$credit.policy) %>%
fct_recode(Meets = "TRUE", Fails = "FALSE")
# Max depth of 4 based on Professor Faruqe's instructions
#creditfit <- rpart(credit.policy ~ int.rate + fico + purpose + installment + log.annual.inc
#+ dti + inq.last.6mths, data=loans_tree, cp=.00001, method="class", control = list(maxdepth = 4) )
creditfit <- rpart(credit.policy ~ purpose + int.rate + installment + log.annual.inc + dti + fico + revol.util + inq.last.6mths + delinq.2yrs + pub.rec + not.fully.paid + has.delinq.2yrs + has.pub.rec, data = loans_tree, cp=.01, method="class", contro=list(maxdepth=4))
printcp(creditfit) # display the results
##
## Classification tree:
## rpart(formula = credit.policy ~ purpose + int.rate + installment +
## log.annual.inc + dti + fico + revol.util + inq.last.6mths +
## delinq.2yrs + pub.rec + not.fully.paid + has.delinq.2yrs +
## has.pub.rec, data = loans_tree, method = "class", control = list(maxdepth = 4),
## cp = 0.01)
##
## Variables actually used in tree construction:
## [1] dti fico inq.last.6mths
##
## Root node error: 1868/9578 = 0.19503
##
## n= 9578
##
## CP nsplit rel error xerror xstd
## 1 0.484475 0 1.00000 1.00000 0.020759
## 2 0.187366 1 0.51552 0.51552 0.015755
## 3 0.076017 2 0.32816 0.32816 0.012823
## 4 0.042827 3 0.25214 0.25214 0.011329
## 5 0.010000 4 0.20931 0.20985 0.010380
plotcp(creditfit) # visualize cross-validation results
summary(creditfit) # detailed summary of splits
## Call:
## rpart(formula = credit.policy ~ purpose + int.rate + installment +
## log.annual.inc + dti + fico + revol.util + inq.last.6mths +
## delinq.2yrs + pub.rec + not.fully.paid + has.delinq.2yrs +
## has.pub.rec, data = loans_tree, method = "class", control = list(maxdepth = 4),
## cp = 0.01)
## n= 9578
##
## CP nsplit rel error xerror xstd
## 1 0.48447537 0 1.0000000 1.0000000 0.02075876
## 2 0.18736617 1 0.5155246 0.5155246 0.01575529
## 3 0.07601713 2 0.3281585 0.3281585 0.01282304
## 4 0.04282655 3 0.2521413 0.2521413 0.01132880
## 5 0.01000000 4 0.2093148 0.2098501 0.01037987
##
## Variable importance
## inq.last.6mths fico dti int.rate
## 55 36 6 2
##
## Node number 1: 9578 observations, complexity param=0.4844754
## predicted class=Meets expected loss=0.1950303 P(node) =1
## class counts: 1868 7710
## probabilities: 0.195 0.805
## left son=2 (1231 obs) right son=3 (8347 obs)
## Primary splits:
## inq.last.6mths < 3.5 to the right, improve=1277.88200, (0 missing)
## fico < 659.5 to the left, improve= 661.04680, (0 missing)
## int.rate < 0.1323 to the right, improve= 225.24690, (0 missing)
## dti < 24.995 to the right, improve= 191.29530, (0 missing)
## not.fully.paid < 0.5 to the right, improve= 75.18918, (0 missing)
## Surrogate splits:
## int.rate < 0.19965 to the right, agree=0.872, adj=0.006, (0 split)
## installment < 920.22 to the right, agree=0.872, adj=0.002, (0 split)
## fico < 624.5 to the left, agree=0.872, adj=0.002, (0 split)
## revol.util < 106.45 to the right, agree=0.872, adj=0.001, (0 split)
## delinq.2yrs < 7.5 to the right, agree=0.872, adj=0.001, (0 split)
##
## Node number 2: 1231 observations, complexity param=0.07601713
## predicted class=Fails expected loss=0.1324127 P(node) =0.1285237
## class counts: 1068 163
## probabilities: 0.868 0.132
## left son=4 (1047 obs) right son=5 (184 obs)
## Primary splits:
## fico < 739.5 to the left, improve=245.626900, (0 missing)
## int.rate < 0.12215 to the right, improve= 41.259790, (0 missing)
## revol.util < 18.85 to the right, improve= 29.324690, (0 missing)
## installment < 563.175 to the left, improve= 6.872785, (0 missing)
## inq.last.6mths < 7.5 to the right, improve= 6.651680, (0 missing)
## Surrogate splits:
## int.rate < 0.08975 to the right, agree=0.879, adj=0.19, (0 split)
##
## Node number 3: 8347 observations, complexity param=0.1873662
## predicted class=Meets expected loss=0.09584282 P(node) =0.8714763
## class counts: 800 7547
## probabilities: 0.096 0.904
## left son=6 (354 obs) right son=7 (7993 obs)
## Primary splits:
## fico < 659.5 to the left, improve=596.89400, (0 missing)
## dti < 24.995 to the right, improve=171.23690, (0 missing)
## int.rate < 0.1323 to the right, improve= 68.25480, (0 missing)
## revol.util < 99.95 to the right, improve= 34.42161, (0 missing)
## log.annual.inc < 10.03324 to the left, improve= 22.63182, (0 missing)
## Surrogate splits:
## revol.util < 104.75 to the right, agree=0.958, adj=0.011, (0 split)
## delinq.2yrs < 5.5 to the right, agree=0.958, adj=0.006, (0 split)
##
## Node number 4: 1047 observations
## predicted class=Fails expected loss=0 P(node) =0.109313
## class counts: 1047 0
## probabilities: 1.000 0.000
##
## Node number 5: 184 observations
## predicted class=Meets expected loss=0.1141304 P(node) =0.01921069
## class counts: 21 163
## probabilities: 0.114 0.886
##
## Node number 6: 354 observations
## predicted class=Fails expected loss=0.005649718 P(node) =0.0369597
## class counts: 352 2
## probabilities: 0.994 0.006
##
## Node number 7: 7993 observations, complexity param=0.04282655
## predicted class=Meets expected loss=0.05604904 P(node) =0.8345166
## class counts: 448 7545
## probabilities: 0.056 0.944
## left son=14 (88 obs) right son=15 (7905 obs)
## Primary splits:
## dti < 24.995 to the right, improve=143.665800, (0 missing)
## revol.util < 99.95 to the right, improve= 23.204870, (0 missing)
## log.annual.inc < 10.03324 to the left, improve= 15.555710, (0 missing)
## installment < 30.42 to the left, improve= 7.433320, (0 missing)
## fico < 739.5 to the left, improve= 6.749753, (0 missing)
##
## Node number 14: 88 observations
## predicted class=Fails expected loss=0.04545455 P(node) =0.009187722
## class counts: 84 4
## probabilities: 0.955 0.045
##
## Node number 15: 7905 observations
## predicted class=Meets expected loss=0.04604681 P(node) =0.8253289
## class counts: 364 7541
## probabilities: 0.046 0.954
Building the classification tree, we can see that fico,
inq.last.6mths, and dti were used in the tree
construction and the tree was split 4 times.
We can better visualize the tree by plotting it out. We can see the
inq.last.6mths threshold of greater than or equal to four,
as well as the two thresholds of 740 and 660 for FICO
scores. It details that of our 9578 loan sample, there are 1868 loans
that failed to meet the credit.policy and 7710 that
did.
# plot the tree and add text
rpart.plot(creditfit, uniform=TRUE, main="Classification Tree for Credit.Policy", digits = 3, extra = 1)
#plot(creditfit, uniform=TRUE, main="Classification Tree for credit.policy")
#text(creditfit, use.n=TRUE, all=TRUE, cex=.75)
Of all of the loans, 1231 had greater than or equal to four
inq.last.6mths. Of those 1231:
FICO lower than 740 and failed to meet the
credit.policyFICO score 740 or higher and failed to meet
the credit.policyFICO score 740 or higher and met the
credit.policyThis indicates that applicants will likely need a FICO
score of 740 or higher if they have greater than or equal to four
inq.last.6mths in order to anticipate meeting the
credit.policy.
Conversely, of all of the loans, 8347 had less than four
inq.last.6mths. Of those 8347:
FICO lower than 660 and failed to meet the
credit.policyFICO lower than 660 but still managed to meet
the credit.policyFICO score 660 or higher and failed to meet
the credit.policyFICO score 660 or higher and met the
credit.policyThis indicates that applicants will likely need a FICO
score of only 660 or higher if they have less than four
inq.last.6mths in order to anticipate meeting the
credit.policy. We do not consider the threshold for
dti to be that meaningful because the value of 25 is very
high and not representative of most of the loan applicants.
Overall, we can see that less inq.last.6mths aligns with
lower FICO score requirements for the
credit.policy. Generally, if applicants have equal to or
more than four inq.last.6mths, then they will likely need a
FICO score of 740 or higher to meet the
credit.policy. On the other hand, if applicants have less
than four inq.last.6mths, then they will likely only need a
FICO score of 660 or higher to meet the
credit.policy. If applicants have a FICO score
lower than 660, then they are unlikely to meet the
credit.policy, regardless.
To better understand how our model performs we can construct a
confusion matrix of the results. The Fails values will
function as the positive cases for the confusion matrix we put together
based on the results of the classification tree model.
# Creating the confusion matrix
confusion_matrix = confusionMatrix(predict(creditfit, type = "class"), reference = loans_tree$credit.policy)
print('Overall: ')
## [1] "Overall: "
confusion_matrix$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 9.591773e-01 8.591600e-01 9.550201e-01 9.630522e-01 8.049697e-01
## AccuracyPValue McnemarPValue
## 0.000000e+00 1.848233e-81
print('Class: ')
## [1] "Class: "
confusion_matrix$byClass
## Sensitivity Specificity Pos Pred Value
## 0.7938972 0.9992218 0.9959704
## Neg Pred Value Precision Recall
## 0.9524045 0.9959704 0.7938972
## F1 Prevalence Detection Rate
## 0.8835270 0.1950303 0.1548340
## Detection Prevalence Balanced Accuracy
## 0.1554604 0.8965595
#The Kappa statistic (or value) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance). The kappa statistic is used not only to evaluate a single classifier, but also to evaluate classifiers amongst themselves.According to their scheme a value < 0 is indicating no agreement , 0–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1 as almost perfect agreement.
Let’s take a closer look and put together tables to visualize the confusion matrix as well some statistics on it.
# Creating the confusion matrix table
# To get a table that follows the same format as our notes, with the rows representing true values and columns representing predicted values, we will have to adjust the data a bit.
# Start by converting to a data.frame
cm_table <- as.data.frame(confusion_matrix$table) %>%
rename(Actual = "Reference")
# Make it so that the rows represent true values and columns represent predicted values
cm_table$Prediction <- paste0("Prediction - ", cm_table$Prediction)
cm_table <- spread(cm_table, Prediction, Freq)
# Output a nice table
# xkabledply(cm_table, title = "Confusion matrix for the tree model")
cm_table %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)
| Actual | Prediction - Fails | Prediction - Meets |
|---|---|---|
| Fails | 1483 | 385 |
| Meets | 6 | 7704 |
## Can we bold the values in the first column ("Actual") using kable??
# Creating a table for the confusion matrix statistics
# Start by converting to a data.frame
cm_stats <- as.data.frame(confusion_matrix$byClass) %>%
rownames_to_column()
# Adjust the column names and values to look better
colnames(cm_stats) <- c("Statistic", "Percentage")
cm_stats$Percentage <- paste0(round(cm_stats$Percentage*100, digits=1), "%")
# Output a nice table
# xkabledply(cm_stats, title = "Confusion matrix statistics for the tree model")
cm_stats %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)
| Statistic | Percentage |
|---|---|
| Sensitivity | 79.4% |
| Specificity | 99.9% |
| Pos Pred Value | 99.6% |
| Neg Pred Value | 95.2% |
| Precision | 99.6% |
| Recall | 79.4% |
| F1 | 88.4% |
| Prevalence | 19.5% |
| Detection Rate | 15.5% |
| Detection Prevalence | 15.5% |
| Balanced Accuracy | 89.7% |
Based on the Kappa statistic of 0.86 we would evaluate this classifier as almost perfect agreement.
The sensitivity and specificity values indicate the test can fairly reliably detect loans that meet and fail to meet the credit policy. We can also tell that we have a favorable, high arching, ROC curve after looking at the these values along with with the AUC.
The high precision indicates the tree is very good identifying true positives, in that 99.6% of its predicted fails were right.
The high kappa value of .859 suggests there is a high degree of agreement between the predictions and the actual values.
Another way that we can evaluate the performance of the model is to look at the ROC curve and the area under the curve (AUC).
# loans_tree$prediction <- predict(creditfit, type = "vector")
loans_tree$prediction <- predict(creditfit, type = "prob")[,2]
tree_roc <- roc(credit.policy ~ prediction, data=loans_tree)
auc(tree_roc) # area-under-curve prefer 0.8 or higher.
plot(tree_roc)
The AUC of the ROC is 0.9002586, which is greater than the 0.8 threshold, so we would consider this to be a good indicator/model. We can conclude that this is overall an effective model for determining if a loan will meet the credit underwriting policy set by LendingClub.
Similar to what we did for, credit.policy, was also did
to not.fully.paid and treated it as a 2-level factor for
the tree model.
In this case, we are interested in the loans where the value is
TRUE. That is, loans that were not fully paid, which are in
the minority. We will rename the TRUE values to
Unpaid, indicating they failed to be fully paid, and rename
the FALSE values to Paid, indicating they were
paid.
Our initial classification tree did not choose any variables and had
only a single trunk node, a kappa value of 0, and a ROC curve that fit
directly on top of the random classifier line. We interpreted this as a
strong indicator that a classification tree would not be useful for
not.fully.paid.
To confirm, we overrode the default complexity parameter in the rpart
function and set it from .01 to .001. This led to the classification
tree using the variables credit.policy,
inq.last.6mths, int.rate, and
purpose.
# Since we are renaming some values let's make a separate dataset for this model
loans_tree_nfp <- loans
# First convert the credit.policy variable from a logical to a factor, then rename the levels
loans_tree_nfp$not.fully.paid <- as.factor(loans_tree_nfp$not.fully.paid) %>%
fct_recode(Unpaid = "TRUE", Paid = "FALSE")
# Max depth of 4 based on Professor Faruqe's instructions
#creditfit <- rpart(credit.policy ~ int.rate + fico + purpose + installment + log.annual.inc
#+ dti + inq.last.6mths, data=loans_tree, cp=.00001, method="class", control = list(maxdepth = 4) )
creditfit_nfp <- rpart(not.fully.paid ~ purpose + int.rate + installment + log.annual.inc + dti + fico + days.with.cr.line + revol.bal + revol.util + inq.last.6mths + delinq.2yrs + pub.rec + credit.policy + has.delinq.2yrs + has.pub.rec + log.revol.bal + has.revol.bal, data = loans_tree_nfp, cp=.001, method="class", control=list(maxdepth=4))
printcp(creditfit_nfp) # display the results
##
## Classification tree:
## rpart(formula = not.fully.paid ~ purpose + int.rate + installment +
## log.annual.inc + dti + fico + days.with.cr.line + revol.bal +
## revol.util + inq.last.6mths + delinq.2yrs + pub.rec + credit.policy +
## has.delinq.2yrs + has.pub.rec + log.revol.bal + has.revol.bal,
## data = loans_tree_nfp, method = "class", control = list(maxdepth = 4),
## cp = 0.001)
##
## Variables actually used in tree construction:
## [1] credit.policy inq.last.6mths int.rate purpose
##
## Root node error: 1533/9578 = 0.16005
##
## n= 9578
##
## CP nsplit rel error xerror xstd
## 1 0.0014677 0 1.00000 1.0000 0.023407
## 2 0.0010000 4 0.99413 1.0072 0.023475
plotcp(creditfit_nfp) # visualize cross-validation results
summary(creditfit_nfp) # detailed summary of splits
## Call:
## rpart(formula = not.fully.paid ~ purpose + int.rate + installment +
## log.annual.inc + dti + fico + days.with.cr.line + revol.bal +
## revol.util + inq.last.6mths + delinq.2yrs + pub.rec + credit.policy +
## has.delinq.2yrs + has.pub.rec + log.revol.bal + has.revol.bal,
## data = loans_tree_nfp, method = "class", control = list(maxdepth = 4),
## cp = 0.001)
## n= 9578
##
## CP nsplit rel error xerror xstd
## 1 0.00146771 0 1.0000000 1.000000 0.02340747
## 2 0.00100000 4 0.9941292 1.007175 0.02347524
##
## Variable importance
## credit.policy inq.last.6mths fico days.with.cr.line
## 42 26 11 6
## purpose dti int.rate revol.bal
## 5 3 3 3
## installment
## 1
##
## Node number 1: 9578 observations, complexity param=0.00146771
## predicted class=Paid expected loss=0.1600543 P(node) =1
## class counts: 8045 1533
## probabilities: 0.840 0.160
## left son=2 (7710 obs) right son=3 (1868 obs)
## Primary splits:
## credit.policy < 0.5 to the right, improve=64.38613, (0 missing)
## inq.last.6mths < 2.5 to the left, improve=49.21306, (0 missing)
## int.rate < 0.09325 to the left, improve=43.91818, (0 missing)
## fico < 709.5 to the right, improve=37.66786, (0 missing)
## purpose splits as LLLLLLR, improve=18.37063, (0 missing)
## Surrogate splits:
## inq.last.6mths < 3.5 to the left, agree=0.899, adj=0.484, (0 split)
## fico < 659.5 to the right, agree=0.856, adj=0.260, (0 split)
## days.with.cr.line < 1109.979 to the right, agree=0.833, adj=0.143, (0 split)
## dti < 24.995 to the left, agree=0.820, adj=0.079, (0 split)
## revol.bal < 105421 to the left, agree=0.818, adj=0.066, (0 split)
##
## Node number 2: 7710 observations
## predicted class=Paid expected loss=0.1315175 P(node) =0.8049697
## class counts: 6696 1014
## probabilities: 0.868 0.132
##
## Node number 3: 1868 observations, complexity param=0.00146771
## predicted class=Paid expected loss=0.2778373 P(node) =0.1950303
## class counts: 1349 519
## probabilities: 0.722 0.278
## left son=6 (1433 obs) right son=7 (435 obs)
## Primary splits:
## inq.last.6mths < 5.5 to the left, improve=8.267491, (0 missing)
## purpose splits as LLLLRLR, improve=8.035853, (0 missing)
## installment < 688.415 to the left, improve=5.664644, (0 missing)
## int.rate < 0.09355 to the left, improve=5.396424, (0 missing)
## fico < 739.5 to the right, improve=3.761452, (0 missing)
## Surrogate splits:
## int.rate < 0.18995 to the left, agree=0.774, adj=0.028, (0 split)
## revol.bal < 777266 to the left, agree=0.768, adj=0.005, (0 split)
## log.revol.bal < 13.53761 to the left, agree=0.768, adj=0.005, (0 split)
## revol.util < 106.45 to the left, agree=0.768, adj=0.002, (0 split)
## delinq.2yrs < 7.5 to the left, agree=0.768, adj=0.002, (0 split)
##
## Node number 6: 1433 observations, complexity param=0.00146771
## predicted class=Paid expected loss=0.2519191 P(node) =0.1496137
## class counts: 1072 361
## probabilities: 0.748 0.252
## left son=12 (1266 obs) right son=13 (167 obs)
## Primary splits:
## purpose splits as LLLLRLR, improve=7.762363, (0 missing)
## installment < 688.415 to the left, improve=5.816501, (0 missing)
## int.rate < 0.16115 to the left, improve=5.635554, (0 missing)
## log.annual.inc < 11.91168 to the left, improve=3.772865, (0 missing)
## fico < 724.5 to the right, improve=3.409773, (0 missing)
## Surrogate splits:
## installment < 842.315 to the left, agree=0.888, adj=0.042, (0 split)
## int.rate < 0.18995 to the left, agree=0.886, adj=0.024, (0 split)
## log.annual.inc < 12.79886 to the left, agree=0.885, adj=0.012, (0 split)
## fico < 804.5 to the left, agree=0.884, adj=0.006, (0 split)
##
## Node number 7: 435 observations
## predicted class=Paid expected loss=0.3632184 P(node) =0.04541658
## class counts: 277 158
## probabilities: 0.637 0.363
##
## Node number 12: 1266 observations
## predicted class=Paid expected loss=0.2330174 P(node) =0.1321779
## class counts: 971 295
## probabilities: 0.767 0.233
##
## Node number 13: 167 observations, complexity param=0.00146771
## predicted class=Paid expected loss=0.3952096 P(node) =0.01743579
## class counts: 101 66
## probabilities: 0.605 0.395
## left son=26 (136 obs) right son=27 (31 obs)
## Primary splits:
## int.rate < 0.17125 to the left, improve=4.756434, (0 missing)
## installment < 183.925 to the left, improve=3.673478, (0 missing)
## dti < 0.935 to the left, improve=2.971576, (0 missing)
## revol.bal < 210900 to the right, improve=2.583991, (0 missing)
## log.revol.bal < 12.25914 to the right, improve=2.583991, (0 missing)
## Surrogate splits:
## installment < 882.23 to the left, agree=0.850, adj=0.194, (0 split)
## days.with.cr.line < 11819.98 to the left, agree=0.826, adj=0.065, (0 split)
## dti < 25.175 to the left, agree=0.820, adj=0.032, (0 split)
##
## Node number 26: 136 observations
## predicted class=Paid expected loss=0.3382353 P(node) =0.01419921
## class counts: 90 46
## probabilities: 0.662 0.338
##
## Node number 27: 31 observations
## predicted class=Unpaid expected loss=0.3548387 P(node) =0.003236584
## class counts: 11 20
## probabilities: 0.355 0.645
We can better visualize this tree by plotting it out.
# plot the tree and add text
rpart.plot(creditfit_nfp, uniform=TRUE, main="Classification Tree for Not Fully Paid", digits = 3, extra = 1)
#plot(creditfit, uniform=TRUE, main="Classification Tree for credit.policy")
#text(creditfit, use.n=TRUE, all=TRUE, cex=.75)
We can see that the tree is not vary useful. The leaf nodes do not
come close to zero’ing out and are not very segregated. Overall, we can
only infer from the tree that most loans that met the credit policy
ended up paying the loan, and that a high number of
inq.last.6mths and/or a high int.rate are
yellow flags that point to a loan possibly not being fully paid, all of
which are what we would expect for any general loan dataset.
We can speculate that loans used for home_improvement
and small_business with high interest rates and a high
number of inq.last.6mths have a higher likelihood of not
getting paid, but this too would fit generic expectations.
To better understand how our model performs we can construct a
confusion matrix of the results. The Unpaid values will
function as the positive cases for the confusion matrix we put together
based on the results of the classification tree model.
# Creating the confusion matrix
confusion_matrix_nfp = confusionMatrix(predict(creditfit_nfp, type = "class"), reference = loans_tree_nfp$not.fully.paid)
print('Overall: ')
## [1] "Overall: "
confusion_matrix_nfp$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 8.408854e-01 1.935338e-02 8.334043e-01 8.481585e-01 8.399457e-01
## AccuracyPValue McnemarPValue
## 4.075345e-01 1.976263e-323
print('Class: ')
## [1] "Class: "
confusion_matrix_nfp$byClass
## Sensitivity Specificity Pos Pred Value
## 0.99863269 0.01304631 0.84152090
## Neg Pred Value Precision Recall
## 0.64516129 0.84152090 0.99863269
## F1 Prevalence Detection Rate
## 0.91336971 0.83994571 0.83879724
## Detection Prevalence Balanced Accuracy
## 0.99676342 0.50583950
#The Kappa statistic (or value) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance). The kappa statistic is used not only to evaluate a single classifier, but also to evaluate classifiers amongst themselves.According to their scheme a value < 0 is indicating no agreement , 0–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1 as almost perfect agreement.
Let’s put together tables to visualize the confusion matrix as well some statistics on it.
# Creating the confusion matrix table
# To get a table that follows the same format as our notes, with the rows representing true values and columns representing predicted values, we will have to adjust the data a bit.
# Start by converting to a data.frame
cm_table_nfp <- as.data.frame(confusion_matrix_nfp$table) %>%
rename(Actual = "Reference")
# Make it so that the rows represent true values and columns represent predicted values
cm_table_nfp$Prediction <- paste0("Prediction - ", cm_table_nfp$Prediction)
cm_table_nfp <- spread(cm_table_nfp, Prediction, Freq)
# Output a nice table
# xkabledply(cm_table, title = "Confusion matrix for the tree model")
cm_table_nfp %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)
| Actual | Prediction - Paid | Prediction - Unpaid |
|---|---|---|
| Paid | 8034 | 11 |
| Unpaid | 1513 | 20 |
## Can we bold the values in the first column ("Actual") using kable??
#Creating a table for the confusion matrix statistics
# Start by converting to a data.frame
cm_stats_nfp <- as.data.frame(confusion_matrix_nfp$byClass) %>%
rownames_to_column()
# Adjust the column names and values to look better
colnames(cm_stats_nfp) <- c("Statistic", "Percentage")
cm_stats_nfp$Percentage <- paste0(round(cm_stats_nfp$Percentage*100, digits=1), "%")
# Output a nice table
# xkabledply(cm_stats, title = "Confusion matrix statistics for the tree model")
cm_stats_nfp %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)
| Statistic | Percentage |
|---|---|
| Sensitivity | 99.9% |
| Specificity | 1.3% |
| Pos Pred Value | 84.2% |
| Neg Pred Value | 64.5% |
| Precision | 84.2% |
| Recall | 99.9% |
| F1 | 91.3% |
| Prevalence | 84% |
| Detection Rate | 83.9% |
| Detection Prevalence | 99.7% |
| Balanced Accuracy | 50.6% |
We immediately see the extremely low specificity value, which indicates the tree was terrible at detecting those that went unpaid. Combined with the low AUC indicates a less than promising ROC curve.
Based on the Kappa statistic of 0.02 we would evaluate this
classifier as only slight agreement, which confirms our initial
suspicion that the classification tree for not.fully.paid
would not be useful.
Another way that we can evaluate the performance of the model is to look at the ROC curve and the area under the curve (AUC).
# loans_tree$prediction <- predict(creditfit, type = "vector")
loans_tree_nfp$prediction <- predict(creditfit_nfp, type = "prob")[,2]
tree_roc_nfp <- roc(not.fully.paid ~ prediction, data=loans_tree_nfp)
auc(tree_roc_nfp) # area-under-curve prefer 0.8 or higher.
plot(tree_roc_nfp)
The AUC of the ROC is 0.5899987, which is much less than the preferred 0.8 threshold, so we would consider this to be a poor model. We can conclude that this is overall a poor model for determining if a loan will default.
From the correlation plot we see 5 combinations of variables with notable correlations (greater than 0.4 or less than -0.4) that we would like to explore further by trying to model them with simple linear regressions.
# fit linear model
linear_model_1 <- lm(int.rate ~ fico, data=loans)
# view summary of linear model
summary(linear_model_1)
# scatterplot with regression line and confidence interval
scatterplot(int.rate ~ fico, data = loans)
#loans %>%
# ggplot(aes(x = fico, y = int.rate)) +
# geom_point(color = "steelblue", alpha = 0.2) +
# geom_smooth(method = "lm", se = T) +
# labs(title = "Interest Rate vs FICO Score",
# x = "FICO Score", y = "Interest Rate") +
# scale_x_continuous(limits = c(600, NA), expand = expansion(mult = c(0, .05))) +
# scale_y_continuous(labels = label_percent(), limits = c(.05, NA), expand = expansion(mult = c(0, .05))) +
# theme_minimal()
# fit linear model
linear_model_2 <- lm(revol.util ~ int.rate, data=loans)
# view summary of linear model
summary(linear_model_2)
# scatterplot with regression line and confidence interval
scatterplot(revol.util ~ int.rate, data = loans)
#loans %>%
# ggplot(aes(x = int.rate, y = revol.util)) +
# geom_point(color = "steelblue", alpha = 0.2) +
# geom_smooth(method = "lm") +
# labs(title = "Revolving Line Utilization Rate vs Interest Rate",
# x = "Interest Rate", y = "Revolving Line Utilization Rate") +
# scale_x_continuous(labels = label_percent(), limits = c(.05, NA), expand = expansion(mult = c(0, .05))) +
# scale_y_continuous(labels = label_percent(scale = 1)) +
# theme_minimal()
# fit linear model
linear_model_3 <- lm(installment ~ log.annual.inc, data=loans)
# view summary of linear model
summary(linear_model_3)
# scatterplot with regression line and confidence interval
scatterplot(installment ~ log.annual.inc, data = loans)
#loans %>%
# ggplot(aes(x = log.annual.inc, y = installment)) +
# geom_point(color = "steelblue", alpha = 0.2) +
# geom_smooth(method = "lm") +
# labs(title = "Installment vs Log of Annual Income",
# x = "Log of Annual Income", y = "Installment") +
# theme_minimal()
# fit linear model
linear_model_4 <- lm(revol.util ~ fico, data=loans)
# view summary of linear model
summary(linear_model_4)
# scatterplot with regression line and confidence interval
scatterplot(revol.util ~ fico, data = loans)
#loans %>%
# ggplot(aes(x = fico, y = revol.util)) +
# geom_point(color = "steelblue", alpha = 0.2) +
# geom_smooth(method = "lm") +
# labs(title = "Revolving Line Utilization Rate vs FICO Score",
# x = "FICO Score", y = "Revolving Line Utilization Rate") +
# scale_x_continuous(limits = c(600, NA), expand = expansion(mult = c(0, .05))) +
# scale_y_continuous(labels = label_percent(scale = 1)) +
# theme_minimal()
# fit linear model
linear_model_5 <- lm(revol.util ~ log.revol.bal, data=loans)
# view summary of linear model
summary(linear_model_5)
# scatterplot with regression line and confidence interval
scatterplot(revol.util ~ log.revol.bal, data = loans)
#loans %>%
# ggplot(aes(x = fico, y = revol.util)) +
# geom_point(color = "steelblue", alpha = 0.2) +
# geom_smooth(method = "lm") +
# labs(title = "Revolving Line Utilization Rate vs FICO Score",
# x = "FICO Score", y = "Revolving Line Utilization Rate") +
# scale_x_continuous(limits = c(600, NA), expand = expansion(mult = c(0, .05))) +
# scale_y_continuous(labels = label_percent(scale = 1)) +
# theme_minimal()
The first one is interest rate vs FICO score. We can see from the graph that the relation is almost linear as the linear regression line lies close to the true regression line. This is further confirmed by the adjusted r^2 value that is relatively significant at 51% and the p value at least than 0.05.
The second one is revolving utilization rate vs interest rate. We can see from the graph that the relation is not as linear as the linear regression line lies tangent to the true regression line. This is further confirmed by the adjusted r^2 value that is relatively low at 21% and the p value at least than 0.05.
The third one is installment vs natural log of annual income. We can see from the graph that the relation is not as linear as the linear regression line lies tangent to the true regression line but converges as both variables increase. This is further confirmed by the adjusted r^2 value that is relatively low at 20% and the p value at least than 0.05.
The fourth one is revolving utilization rate vs FICO score. We can see from the graph that the relation is not as linear as the linear regression line lies relatively loser to the true regression line in comparison to the previous two cases. This is further confirmed by the adjusted r^2 value that is relatively low at 29% and the p value at least than 0.05.
The fifth one is revolving utilization rate vs natural log of revolving balance. We can see that the true regression is far from linear the adjusted r^2 is relatively low at 24.85%.
Overall we can conclude that simple linear regression is not an effective way to model these relationships, though the first model isn’t too bad.
To study the effects on credit.policy by the factor
purpose (credit.policy and
purpose are both categorical variables), we can create
two-way contingency table of the outcome and predictors, and make sure
there are no cells of zero frequencies.
credit_policy_purpose_table <- xtabs(~ credit.policy + purpose, data = loans)
credit_policy_purpose_table %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)
| all_other | credit_card | debt_consolidation | educational | home_improvement | major_purchase | small_business | |
|---|---|---|---|---|---|---|---|
| FALSE | 496 | 242 | 734 | 89 | 117 | 66 | 124 |
| TRUE | 1835 | 1020 | 3223 | 254 | 512 | 371 | 495 |
set.seed(7)
loans_logit_credit <- loans
loans_logit_credit$credit.policy = as.factor(loans_logit_credit$credit.policy)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
logit_credit_training <- train(credit.policy~., data=loans_logit_credit, method="glm", preProcess="scale", trControl=control)
# estimate variable importance
logit_credit_importance <- varImp(logit_credit_training, scale=FALSE)
# summarize importance
print(logit_credit_importance)
## glm variable importance
##
## only 20 most important variables shown (out of 22)
##
## Overall
## inq.last.6mths 36.8650
## revol.bal 22.8914
## fico 20.9002
## log.revol.bal 8.7245
## log.annual.inc 7.8240
## days.with.cr.line 5.2872
## has.revol.balTRUE 4.1136
## delinq.2yrs 4.0613
## has.delinq.2yrsTRUE 3.6786
## not.fully.paidTRUE 2.4499
## has.pub.recTRUE 2.2738
## pub.rec 2.1131
## int.rate 2.0640
## installment 1.9835
## purposesmall_business 1.5220
## dti 1.4768
## purposedebt_consolidation 1.3417
## purposemajor_purchase 1.2737
## purposehome_improvement 1.1735
## purposeeducational 0.7816
# plot importance
plot(logit_credit_importance)
logit_credit_model <- glm(credit.policy ~ inq.last.6mths+fico+log.annual.inc+dti+delinq.2yrs+pub.rec+purpose, data = loans_logit_credit, binomial(link = "logit") )
We can see the summary of the logit model here:
summary(logit_credit_model)
##
## Call:
## glm(formula = credit.policy ~ inq.last.6mths + fico + log.annual.inc +
## dti + delinq.2yrs + pub.rec + purpose, family = binomial(link = "logit"),
## data = loans_logit_credit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8772 0.0833 0.2528 0.4724 2.2754
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -23.291438 1.071709 -21.733 < 2e-16 ***
## inq.last.6mths -0.845198 0.022342 -37.830 < 2e-16 ***
## fico 0.035719 0.001337 26.723 < 2e-16 ***
## log.annual.inc 0.140579 0.057764 2.434 0.014946 *
## dti -0.020498 0.005289 -3.876 0.000106 ***
## delinq.2yrs -0.025545 0.055505 -0.460 0.645352
## pub.rec 0.279098 0.126540 2.206 0.027411 *
## purposecredit_card 0.026786 0.118791 0.225 0.821602
## purposedebt_consolidation 0.408098 0.089425 4.564 5.03e-06 ***
## purposeeducational -0.039397 0.184848 -0.213 0.831223
## purposehome_improvement 0.285835 0.160903 1.776 0.075660 .
## purposemajor_purchase 0.368912 0.192735 1.914 0.055609 .
## purposesmall_business 0.168896 0.155700 1.085 0.278033
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9452.2 on 9577 degrees of freedom
## Residual deviance: 5507.3 on 9565 degrees of freedom
## AIC: 5533.3
##
## Number of Fisher Scoring iterations: 6
# nicer table for the model summary
xkabledply(logit_credit_model, title = "Logistic Regression :")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -23.2914 | 1.0717 | -21.7330 | 0.0000 |
| inq.last.6mths | -0.8452 | 0.0223 | -37.8305 | 0.0000 |
| fico | 0.0357 | 0.0013 | 26.7228 | 0.0000 |
| log.annual.inc | 0.1406 | 0.0578 | 2.4337 | 0.0149 |
| dti | -0.0205 | 0.0053 | -3.8755 | 0.0001 |
| delinq.2yrs | -0.0255 | 0.0555 | -0.4602 | 0.6454 |
| pub.rec | 0.2791 | 0.1265 | 2.2056 | 0.0274 |
| purposecredit_card | 0.0268 | 0.1188 | 0.2255 | 0.8216 |
| purposedebt_consolidation | 0.4081 | 0.0894 | 4.5636 | 0.0000 |
| purposeeducational | -0.0394 | 0.1848 | -0.2131 | 0.8312 |
| purposehome_improvement | 0.2858 | 0.1609 | 1.7764 | 0.0757 |
| purposemajor_purchase | 0.3689 | 0.1927 | 1.9141 | 0.0556 |
| purposesmall_business | 0.1689 | 0.1557 | 1.0847 | 0.2780 |
Before moving on, let us look at the model object
logit_credit_model a little deeper. The fitted values can
be found from logit_credit_model$fitted.values. And the
first fitted value is 0.9904137. This is the probability of
credit.policyted for data point #1. Compare to the value from
predict(logit_credit_model) to be 4.6377906.
The predict() function gives us the logit value. You can
exponentiate to get the odds ratio p/q as 103.315827. And finally, we
can find p from p/q, and indeed it is confirmed to be 0.9904137.
The easier way to get that is simply use
predict(logit_credit_model, type = c("response"))[1] =
0.9904137. The predict() function will also allow you to
find model prediction with unseen/untrained data points where
fitted.values do not give.
logit_credit_p_fitted <- logit_credit_model$fitted.values[1] # this is the model predicated value p-hat for the first data row (not the actual data point p)
This is stored in the model as the fitted value for the probability
p of the first data point. Since the actual data point is a
0/1 True/False value, it is not easy to directly compare them unless we
use a cutoff value (default as 0.5) to convert the probability
p to 0/1.
Now, for unseen data point, we can use the predict()
function to find the model predicted values. But be careful of what you
are getting with the predict() function in classification
models. Let us compare the three options below. For easy comparison, let
us use the same data point in the dataset as an example.
# This gives you the predicted values of the data points inside the model.
predict(logit_credit_model) # the is from the model, which gives you the value for logit(p) or ln(p/q)
We can easily determine the confidence intervals of each coefficient with these two slightly different ways:
## CIs using profiled log-likelihood
# confint(logit_credit_model)
xkabledply(confint(logit_credit_model), title = "CIs using profiled log-likelihood")
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | -25.4097 | -21.2079 |
| inq.last.6mths | -0.8895 | -0.8020 |
| fico | 0.0331 | 0.0384 |
| log.annual.inc | 0.0275 | 0.2539 |
| dti | -0.0309 | -0.0101 |
| delinq.2yrs | -0.1330 | 0.0847 |
| pub.rec | 0.0355 | 0.5317 |
| purposecredit_card | -0.2052 | 0.2606 |
| purposedebt_consolidation | 0.2326 | 0.5832 |
| purposeeducational | -0.3970 | 0.3282 |
| purposehome_improvement | -0.0260 | 0.6051 |
| purposemajor_purchase | -0.0020 | 0.7543 |
| purposesmall_business | -0.1333 | 0.4773 |
## CIs using standard errors
# confint.default(logit_credit_model)
xkabledply(confint.default(logit_credit_model), title = "CIs using standard errors")
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | -25.3919 | -21.1909 |
| inq.last.6mths | -0.8890 | -0.8014 |
| fico | 0.0331 | 0.0383 |
| log.annual.inc | 0.0274 | 0.2538 |
| dti | -0.0309 | -0.0101 |
| delinq.2yrs | -0.1343 | 0.0832 |
| pub.rec | 0.0311 | 0.5271 |
| purposecredit_card | -0.2060 | 0.2596 |
| purposedebt_consolidation | 0.2328 | 0.5834 |
| purposeeducational | -0.4017 | 0.3229 |
| purposehome_improvement | -0.0295 | 0.6012 |
| purposemajor_purchase | -0.0088 | 0.7467 |
| purposesmall_business | -0.1363 | 0.4741 |
xkabledply(confusion_matrix(logit_credit_model), title = "Confusion matrix from Logit Model" )
| Predicted FALSE | Predicted TRUE | Total | |
|---|---|---|---|
| Actual FALSE | 1099 | 769 | 1868 |
| Actual TRUE | 221 | 7489 | 7710 |
| Total | 1320 | 8258 | 9578 |
Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC) measures the true positive rate (or sensitivity) against the false positive rate (or specificity). The area-under-curve is always between 0.5 and 1. Values higher than 0.8 is considered good model fit.
# receiver operating characteristic curve, gives the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is on sensitivity/recall/true-positive-rate vs false_alarm/false-positive-rate/fall-out.
loans_logit_credit$prediction <- predict(logit_credit_model, type = "response" )
roc_logit_credit <- roc(credit.policy ~ prediction, data=loans_logit_credit)
auc(roc_logit_credit) # area-under-curve prefer 0.8 or higher.
# visualize the curve
plot(roc_logit_credit)
We have here the area-under-curve of 0.8873879, which is more than 0.8. This test also agrees with the Hosmer and Lemeshow test that the model is considered a good fit.
McFadden is another evaluation tool we can use on logit regressions. This is part of what is called pseudo-R-squared values for evaluation tests. We can calculate the value directly from its definition if we so choose to.
logit_credit_null <- glm(credit.policy ~ 1, data = loans_logit_credit, family = "binomial")
mcFadden = 1 - logLik(logit_credit_model)/logLik(logit_credit_null)
mcFadden
## 'log Lik.' 0.41736 (df=13)
Or we can use other libraries. The pscl (Political
Science Computational Lab) library has the function pR2()
(pseudo-\(R^2\)) which will do the
trick.
logit_credit_pr2 = pR2(logit_credit_model)
## fitting null model for pseudo-r2
logit_credit_pr2
## llh llhNull G2 McFadden r2ML
## -2753.6284428 -4726.1228998 3944.9889140 0.4173600 0.3375964
## r2CU
## 0.5382092
With the McFadden value of 0.41736, which is analogous to the coefficient of determination \(R^2\), only about 42% of the variation in y is explained by the explanatory variables in the model.
The Hosmer and Lemeshow Goodness of Fit test can be used to evaluate logistic regression fit.
logit_credit_hoslem <- hoslem.test(loans_logit_credit$credit.policy, fitted(logit_credit_model)) # Hosmer and Lemeshow test, a chi-squared test
logit_credit_hoslem
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: loans_logit_credit$credit.policy, fitted(logit_credit_model)
## X-squared = 9578, df = 8, p-value < 2.2e-16
# Have not found a good way to display it.
The p-value is 0 which indicates indicates the model is a good fit.
To study the effects on not.fully.paid by the factor
purpose (not.fully.paid and
purpose are both categorical variables), we can create
two-way contingency table of the outcome and predictors, and make sure
there are no cells of zero frequencies.
nfp_purpose_table = xtabs(~ not.fully.paid + purpose, data = loans)
nfp_purpose_table %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)
| all_other | credit_card | debt_consolidation | educational | home_improvement | major_purchase | small_business | |
|---|---|---|---|---|---|---|---|
| FALSE | 1944 | 1116 | 3354 | 274 | 522 | 388 | 447 |
| TRUE | 387 | 146 | 603 | 69 | 107 | 49 | 172 |
set.seed(7)
loans_logit_nfp <- loans
loans_logit_nfp$not.fully.paid = as.factor(loans_logit_nfp$not.fully.paid)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
logit_nfp_training <- train(not.fully.paid~., data=loans_logit_nfp, method="glm", preProcess="scale", trControl=control)
# estimate variable importance
logit_nfp_importance <- varImp(logit_nfp_training, scale=FALSE)
# summarize importance
print(logit_nfp_importance)
# plot importance
plot(logit_nfp_importance)
logit_nfp_model <- glm(not.fully.paid ~., data = loans_logit_nfp, binomial(link = "logit") )
We can see the summary of the logit model here:
summary(logit_nfp_model)
##
## Call:
## glm(formula = not.fully.paid ~ ., family = binomial(link = "logit"),
## data = loans_logit_nfp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3959 -0.6183 -0.4944 -0.3647 2.6202
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.548e+00 1.298e+00 6.587 4.50e-11 ***
## credit.policyTRUE -3.289e-01 8.473e-02 -3.881 0.000104 ***
## purposecredit_card -5.124e-01 1.099e-01 -4.662 3.13e-06 ***
## purposedebt_consolidation -3.336e-01 7.814e-02 -4.269 1.96e-05 ***
## purposeeducational 6.063e-02 1.522e-01 0.398 0.690368
## purposehome_improvement 9.907e-02 1.266e-01 0.783 0.433722
## purposemajor_purchase -3.369e-01 1.664e-01 -2.025 0.042886 *
## purposesmall_business 5.445e-01 1.173e-01 4.642 3.45e-06 ***
## int.rate 1.299e+00 1.745e+00 0.744 0.456604
## installment 1.228e-03 1.766e-04 6.955 3.52e-12 ***
## log.annual.inc -3.943e-01 6.086e-02 -6.479 9.23e-11 ***
## dti 8.232e-04 4.723e-03 0.174 0.861621
## fico -8.874e-03 1.428e-03 -6.216 5.09e-10 ***
## days.with.cr.line 1.139e-05 1.352e-05 0.842 0.399688
## revol.bal 3.354e-06 1.082e-06 3.101 0.001930 **
## revol.util 3.156e-03 1.432e-03 2.203 0.027569 *
## inq.last.6mths 8.656e-02 1.379e-02 6.275 3.50e-10 ***
## delinq.2yrs -1.172e-01 9.790e-02 -1.197 0.231165
## pub.rec -1.466e+00 7.163e-01 -2.047 0.040648 *
## has.delinq.2yrsTRUE 6.123e-02 1.603e-01 0.382 0.702401
## has.pub.recTRUE 1.909e+00 7.378e-01 2.587 0.009671 **
## log.revol.bal -1.912e-02 3.098e-02 -0.617 0.537161
## has.revol.balTRUE -9.495e-02 2.791e-01 -0.340 0.733661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8424.0 on 9577 degrees of freedom
## Residual deviance: 7836.2 on 9555 degrees of freedom
## AIC: 7882.2
##
## Number of Fisher Scoring iterations: 5
# nicer table for the model summary
xkabledply(logit_nfp_model, title = "Logistic Regression :")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 8.5476 | 1.2977 | 6.5866 | 0.0000 |
| credit.policyTRUE | -0.3289 | 0.0847 | -3.8812 | 0.0001 |
| purposecredit_card | -0.5124 | 0.1099 | -4.6621 | 0.0000 |
| purposedebt_consolidation | -0.3336 | 0.0781 | -4.2689 | 0.0000 |
| purposeeducational | 0.0606 | 0.1522 | 0.3984 | 0.6904 |
| purposehome_improvement | 0.0991 | 0.1266 | 0.7828 | 0.4337 |
| purposemajor_purchase | -0.3369 | 0.1664 | -2.0248 | 0.0429 |
| purposesmall_business | 0.5445 | 0.1173 | 4.6419 | 0.0000 |
| int.rate | 1.2989 | 1.7448 | 0.7445 | 0.4566 |
| installment | 0.0012 | 0.0002 | 6.9552 | 0.0000 |
| log.annual.inc | -0.3943 | 0.0609 | -6.4791 | 0.0000 |
| dti | 0.0008 | 0.0047 | 0.1743 | 0.8616 |
| fico | -0.0089 | 0.0014 | -6.2163 | 0.0000 |
| days.with.cr.line | 0.0000 | 0.0000 | 0.8422 | 0.3997 |
| revol.bal | 0.0000 | 0.0000 | 3.1008 | 0.0019 |
| revol.util | 0.0032 | 0.0014 | 2.2034 | 0.0276 |
| inq.last.6mths | 0.0866 | 0.0138 | 6.2747 | 0.0000 |
| delinq.2yrs | -0.1172 | 0.0979 | -1.1974 | 0.2312 |
| pub.rec | -1.4663 | 0.7163 | -2.0471 | 0.0406 |
| has.delinq.2yrsTRUE | 0.0612 | 0.1603 | 0.3821 | 0.7024 |
| has.pub.recTRUE | 1.9091 | 0.7378 | 2.5874 | 0.0097 |
| log.revol.bal | -0.0191 | 0.0310 | -0.6171 | 0.5372 |
| has.revol.balTRUE | -0.0950 | 0.2791 | -0.3403 | 0.7337 |
Before moving on, let us look at the model object
logit_nfp_model a little deeper. The fitted values can be
found from logit_nfp_model$fitted.values. And the first
fitted value is 0.1292253. This is the probability of being
not.fully.paid for data point #1. Compare to the value from
predict(logit_nfp_model) to be -1.9078255.
The predict() function gives us the logit value. You can
exponentiate to get the odds ratio p/q as 0.1484027. And finally, we can
find p from p/q, and indeed it is confirmed to be 0.1292253.
The easier way to get that is simply use
predict(logit_nfp_model, type = c("response"))[1] =
0.1292253. The predict() function will also allow you to
find model prediction with unseen/untrained data points where
fitted.values do not give.
logit_nfp_p_fitted <- logit_nfp_model$fitted.values[1] # this is the model predicated value p-hat for the first data row (not the actual data point p)
logit_nfp_p_fitted
This is stored in the model as the fitted value for the probability
p of the first data point. Since the actual data point is a
0/1 True/False value, it is not easy to directly compare them unless we
use a cutoff value (default as 0.5) to convert the probability
p to 0/1.
Now, for unseen data point, we can use the predict( )
function to find the model predicted values. But be careful of what you
are getting with the predict() function in classification
models. Let us compare the three options below. For easy comparison, let
us use the same data point in the dataset as an example.
# This gives you the predicted values of the data points inside the model.
predict(logit_nfp_model) # the is from the model, which gives you the value for logit(p) or ln(p/q)
We can easily determine the confidence intervals of each coefficient with these two slightly different ways:
## CIs using profiled log-likelihood
# confint(logit_nfp_model)
xkabledply(confint(logit_nfp_model), title = "CIs using profiled log-likelihood")
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | 6.0106 | 11.0984 |
| credit.policyTRUE | -0.4942 | -0.1620 |
| purposecredit_card | -0.7301 | -0.2990 |
| purposedebt_consolidation | -0.4864 | -0.1800 |
| purposeeducational | -0.2436 | 0.3538 |
| purposehome_improvement | -0.1523 | 0.3441 |
| purposemajor_purchase | -0.6731 | -0.0196 |
| purposesmall_business | 0.3133 | 0.7733 |
| int.rate | -2.1291 | 4.7118 |
| installment | 0.0009 | 0.0016 |
| log.annual.inc | -0.5139 | -0.2753 |
| dti | -0.0084 | 0.0101 |
| fico | -0.0117 | -0.0061 |
| days.with.cr.line | 0.0000 | 0.0000 |
| revol.bal | 0.0000 | 0.0000 |
| revol.util | 0.0003 | 0.0060 |
| inq.last.6mths | 0.0597 | 0.1139 |
| delinq.2yrs | -0.3257 | 0.0595 |
| pub.rec | -3.2789 | -0.3499 |
| has.delinq.2yrsTRUE | -0.2447 | 0.3849 |
| has.pub.recTRUE | 0.7324 | 3.7485 |
| log.revol.bal | -0.0794 | 0.0421 |
| has.revol.balTRUE | -0.6438 | 0.4507 |
## CIs using standard errors
# confint.default(logit_nfp_model)
xkabledply(confint.default(logit_nfp_model), title = "CIs using standard errors")
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | 6.0041 | 11.0910 |
| credit.policyTRUE | -0.4949 | -0.1628 |
| purposecredit_card | -0.7278 | -0.2970 |
| purposedebt_consolidation | -0.4867 | -0.1804 |
| purposeeducational | -0.2377 | 0.3589 |
| purposehome_improvement | -0.1490 | 0.3471 |
| purposemajor_purchase | -0.6630 | -0.0108 |
| purposesmall_business | 0.3146 | 0.7744 |
| int.rate | -2.1208 | 4.7187 |
| installment | 0.0009 | 0.0016 |
| log.annual.inc | -0.5136 | -0.2750 |
| dti | -0.0084 | 0.0101 |
| fico | -0.0117 | -0.0061 |
| days.with.cr.line | 0.0000 | 0.0000 |
| revol.bal | 0.0000 | 0.0000 |
| revol.util | 0.0003 | 0.0060 |
| inq.last.6mths | 0.0595 | 0.1136 |
| delinq.2yrs | -0.3091 | 0.0747 |
| pub.rec | -2.8702 | -0.0624 |
| has.delinq.2yrsTRUE | -0.2529 | 0.3753 |
| has.pub.recTRUE | 0.4629 | 3.3552 |
| log.revol.bal | -0.0798 | 0.0416 |
| has.revol.balTRUE | -0.6419 | 0.4520 |
xkabledply(confusion_matrix(logit_nfp_model), title = "Confusion matrix from Logit Model" )
| Predicted FALSE | Predicted TRUE | Total | |
|---|---|---|---|
| Actual FALSE | 7997 | 48 | 8045 |
| Actual TRUE | 1480 | 53 | 1533 |
| Total | 9477 | 101 | 9578 |
Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC) measures the true positive rate (or sensitivity) against the false positive rate (or specificity). The area-under-curve is always between 0.5 and 1. Values higher than 0.8 is considered good model fit.
# receiver operating characteristic curve, gives the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is on sensitivity/recall/true-positive-rate vs false_alarm/false-positive-rate/fall-out.
loans_logit_nfp$prediction <- predict(logit_nfp_model, type = "response" )
roc_logit_nfp <- roc(not.fully.paid ~ prediction, data=loans_logit_nfp)
auc(roc_logit_nfp) # area-under-curve prefer 0.8 or higher.
# visualize the curve
plot(roc_logit_nfp)
We have here the area-under-curve of 0.6858354, which is less than 0.8. This test also agrees with the Hosmer and Lemeshow test that the model is not considered a good fit.
McFadden is another evaluation tool we can use on logit regressions. This is part of what is called pseudo-R-squared values for evaluation tests. We can calculate the value directly from its definition if we so choose to.
logit_nfp_null <- glm(not.fully.paid ~ 1, data = loans_logit_nfp, family = "binomial")
mcFadden = 1 - logLik(logit_nfp_model)/logLik(logit_nfp_null)
mcFadden
## 'log Lik.' 0.06978116 (df=23)
Or we can use other libraries. The pscl (Political
Science Computational Lab) library has the function pR2()
(pseudo-\(R^2\)) which will do the
trick.
logit_nfp_pr2 = pR2(logit_nfp_model)
## fitting null model for pseudo-r2
logit_nfp_pr2
## llh llhNull G2 McFadden r2ML
## -3.918101e+03 -4.212020e+03 5.878393e+02 6.978116e-02 5.952848e-02
## r2CU
## 1.017550e-01
With the McFadden value of 0.0697812, which is analogous to the coefficient of determination \(R^2\), only about 8% of the variation in y is explained by the explanatory variables in the model.
The Hosmer and Lemeshow Goodness of Fit test can be used to evaluate logistic regression fit.
logit_nfp_hoslem = hoslem.test(loans_logit_nfp$not.fully.paid, fitted(logit_nfp_model)) # Hosmer and Lemeshow test, a chi-squared test
logit_nfp_hoslem
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: loans_logit_nfp$not.fully.paid, fitted(logit_nfp_model)
## X-squared = 9578, df = 8, p-value < 2.2e-16
# Have not found a good way to display it.
The p-value of 0 is relatively high. This indicates the model is not really a good fit, despite all the coefficients are significant.
Since we know that interest rate is higher for riskier loans we wanted to test that by building a model to predict interest rate.
set.seed(7)
loans_mlr_int <- loans
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
mlr_int_training <- train(int.rate~., data=loans_mlr_int, method="lm", preProcess="scale", trControl=control)
# estimate variable importance
mlr_int_importance <- varImp(mlr_int_training, scale=FALSE)
# summarize importance
print(mlr_int_importance)
# plot importance
plot(mlr_int_importance)
mlr_int_model <- lm(int.rate ~ fico+installment+purpose+dti+inq.last.6mths+credit.policy+log.annual.inc, data = loans_mlr_int)
summary(mlr_int_model)
mlr_int_model$coefficients
mlr_int_model$call
coef(mlr_int_model)
## (Intercept) fico installment
## 0.4819228196 -0.0005143307 0.0000428540
## purposecredit_card purposedebt_consolidation purposeeducational
## -0.0033260863 -0.0012546309 0.0001820383
## purposehome_improvement purposemajor_purchase purposesmall_business
## 0.0019883173 0.0014739586 0.0157732687
## dti inq.last.6mths credit.policyTRUE
## 0.0001682692 0.0005470733 -0.0020864159
## log.annual.inc
## -0.0008158434
confint(mlr_int_model)
## 2.5 % 97.5 %
## (Intercept) 4.731186e-01 4.907271e-01
## fico -5.236009e-04 -5.050605e-04
## installment 4.107866e-05 4.462934e-05
## purposecredit_card -4.423902e-03 -2.228270e-03
## purposedebt_consolidation -2.099536e-03 -4.097257e-04
## purposeeducational -1.609548e-03 1.973625e-03
## purposehome_improvement 5.855629e-04 3.391072e-03
## purposemajor_purchase -1.358131e-04 3.083730e-03
## purposesmall_business 1.434543e-02 1.720111e-02
## dti 1.199051e-04 2.166333e-04
## inq.last.6mths 3.765454e-04 7.176012e-04
## credit.policyTRUE -3.076261e-03 -1.096571e-03
## log.annual.inc -1.402356e-03 -2.293305e-04
mlr_int_model_tidy <- tidy(mlr_int_model)
mlr_int_model_tidy
mlr_int_model_summary <- augment(mlr_int_model)
str(mlr_int_model_summary)
head(mlr_int_model_summary)
par(mfrow=c(2,2))
plot(mlr_int_model)
vif(mlr_int_model)
lmtest::bptest(mlr_int_model)
ncvTest(mlr_int_model)
distBCMod <- BoxCoxTrans(loans_mlr_int$int.rate)
print(distBCMod)
loans_mlr_int <- cbind(loans_mlr_int, dist_new=predict(distBCMod, loans_mlr_int$int.rate))
head(loans_mlr_int)
mlr_int_model_2 <- lm(dist_new ~ fico+installment+purpose+revol.util+log.revol.bal+credit.policy+log.annual.inc, data=loans_mlr_int)
summary(mlr_int_model_2)
lmtest::bptest(mlr_int_model_2)
par(mfrow=c(2,2))
plot(mlr_int_model_2)
Note here the first linear model built with the features with the highest importance has no multicolinearity but has high heteroscedasticity. In order to remove that we performed a box cox transformation. That reduced the heteroscedasticity but did not solve the problem So, I used the sandwich library and that helped. The resulting model is homoscedastic and explains 67%.
We decided to try LASSO regression as well to see if we could get a better model for interest rate with this method.
loans_lasso <- loans
# glmnet library can't handle factor variable so the variables have to be converted to dummy variables
loans_lasso$credit.policy.T <- ifelse(loans_lasso$credit.policy == TRUE, 1, 0)
loans_lasso$credit.policy.F <- ifelse(loans_lasso$credit.policy == FALSE, 1, 0)
loans_lasso$has.delinq.2yrs.T <- ifelse(loans_lasso$has.delinq.2yrs == TRUE, 1, 0)
loans_lasso$has.delinq.2yrs.F <- ifelse(loans_lasso$has.delinq.2yrs == FALSE, 1, 0)
loans_lasso$has.revol.bal.T <- ifelse(loans_lasso$has.revol.bal == TRUE, 1, 0)
loans_lasso$has.revol.bal.F <- ifelse(loans_lasso$has.revol.bal == FALSE, 1, 0)
loans_lasso$has.pub.rec.T <- ifelse(loans_lasso$has.pub.rec == TRUE, 1, 0)
loans_lasso$has.pub.rec.F <- ifelse(loans_lasso$has.pub.rec == FALSE, 1, 0)
loans_lasso$has.revol.bal.T <- ifelse(loans_lasso$has.revol.bal == TRUE, 1, 0)
loans_lasso$has.revol.bal.F <- ifelse(loans_lasso$has.revol.bal == FALSE, 1, 0)
loans_lasso$not.fully.paid.T<- ifelse(loans_lasso$not.fully.paid == TRUE, 1, 0)
loans_lasso$not.fully.paid.F<- ifelse(loans_lasso$not.fully.paid == FALSE, 1, 0)
# The below dataframe shows the structure of the data, in which the categorical variables are converted to dummy variables.
str(loans_lasso)
## tibble [9,578 × 28] (S3: tbl_df/tbl/data.frame)
## $ credit.policy : logi [1:9578] TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ purpose : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
## $ int.rate : num [1:9578] 0.119 0.107 0.136 0.101 0.143 ...
## $ installment : num [1:9578] 829 228 367 162 103 ...
## $ log.annual.inc : num [1:9578] 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num [1:9578] 19.5 14.3 11.6 8.1 15 ...
## $ fico : num [1:9578] 737 707 682 712 667 727 667 722 682 707 ...
## $ days.with.cr.line: num [1:9578] 5640 2760 4710 2700 4066 ...
## $ revol.bal : num [1:9578] 28854 33623 3511 33667 4740 ...
## $ revol.util : num [1:9578] 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
## $ inq.last.6mths : num [1:9578] 0 0 1 1 0 0 0 0 1 1 ...
## $ delinq.2yrs : num [1:9578] 0 0 0 0 1 0 0 0 0 0 ...
## $ pub.rec : num [1:9578] 0 0 0 0 0 0 1 0 0 0 ...
## $ not.fully.paid : logi [1:9578] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ has.delinq.2yrs : logi [1:9578] FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ has.pub.rec : logi [1:9578] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ log.revol.bal : num [1:9578] 10.27 10.42 8.16 10.42 8.46 ...
## $ has.revol.bal : logi [1:9578] TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ credit.policy.T : num [1:9578] 1 1 1 1 1 1 1 1 1 1 ...
## $ credit.policy.F : num [1:9578] 0 0 0 0 0 0 0 0 0 0 ...
## $ has.delinq.2yrs.T: num [1:9578] 0 0 0 0 1 0 0 0 0 0 ...
## $ has.delinq.2yrs.F: num [1:9578] 1 1 1 1 0 1 1 1 1 1 ...
## $ has.revol.bal.T : num [1:9578] 1 1 1 1 1 1 1 1 1 1 ...
## $ has.revol.bal.F : num [1:9578] 0 0 0 0 0 0 0 0 0 0 ...
## $ has.pub.rec.T : num [1:9578] 0 0 0 0 0 0 1 0 0 0 ...
## $ has.pub.rec.F : num [1:9578] 1 1 1 1 1 1 0 1 1 1 ...
## $ not.fully.paid.T : num [1:9578] 0 0 0 0 0 0 1 1 0 0 ...
## $ not.fully.paid.F : num [1:9578] 1 1 1 1 1 1 0 0 1 1 ...
The original categorical variables were dropped by keeping the dummy variables and using it for training and testing. The predictor dataframe included all the variables except the interest rate, as interest rate is the response.
# Extracting the values of Y
lasso_y <- loans_lasso$int.rate
# Extracting the values of X
# lasso_x <- loans_lasso %>%
# select(purpose,int.rate,credit.policy,not.fully.paid,has.delinq.2yrs,has.pub.rec,has.revol.bal,log.revol.bal,X,revol.util)
lasso_x <- subset(loans_lasso, select = -c(purpose,int.rate,credit.policy,not.fully.paid,has.delinq.2yrs,has.pub.rec,has.revol.bal,log.revol.bal,revol.util))
str(lasso_x)
## tibble [9,578 × 19] (S3: tbl_df/tbl/data.frame)
## $ installment : num [1:9578] 829 228 367 162 103 ...
## $ log.annual.inc : num [1:9578] 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num [1:9578] 19.5 14.3 11.6 8.1 15 ...
## $ fico : num [1:9578] 737 707 682 712 667 727 667 722 682 707 ...
## $ days.with.cr.line: num [1:9578] 5640 2760 4710 2700 4066 ...
## $ revol.bal : num [1:9578] 28854 33623 3511 33667 4740 ...
## $ inq.last.6mths : num [1:9578] 0 0 1 1 0 0 0 0 1 1 ...
## $ delinq.2yrs : num [1:9578] 0 0 0 0 1 0 0 0 0 0 ...
## $ pub.rec : num [1:9578] 0 0 0 0 0 0 1 0 0 0 ...
## $ credit.policy.T : num [1:9578] 1 1 1 1 1 1 1 1 1 1 ...
## $ credit.policy.F : num [1:9578] 0 0 0 0 0 0 0 0 0 0 ...
## $ has.delinq.2yrs.T: num [1:9578] 0 0 0 0 1 0 0 0 0 0 ...
## $ has.delinq.2yrs.F: num [1:9578] 1 1 1 1 0 1 1 1 1 1 ...
## $ has.revol.bal.T : num [1:9578] 1 1 1 1 1 1 1 1 1 1 ...
## $ has.revol.bal.F : num [1:9578] 0 0 0 0 0 0 0 0 0 0 ...
## $ has.pub.rec.T : num [1:9578] 0 0 0 0 0 0 1 0 0 0 ...
## $ has.pub.rec.F : num [1:9578] 1 1 1 1 1 1 0 1 1 1 ...
## $ not.fully.paid.T : num [1:9578] 0 0 0 0 0 0 1 1 0 0 ...
## $ not.fully.paid.F : num [1:9578] 1 1 1 1 1 1 0 0 1 1 ...
The entire data regressor and response data was split in to 70% and 30% for training and testing purpose.
# Partition and create the index matrix of selected values
lasso_index <- createDataPartition(loans_lasso$int.rate, p=0.7, list=FALSE, times=1)
# Creating Test and Training Data in the ratio of 70% and 30%
lasso_train_x <- as.matrix(lasso_x[lasso_index,])
lasso_test_x <- as.matrix(lasso_x[-lasso_index,])
#train_datx=(lasso_x[index,])
lasso_train_y <- as.matrix(lasso_y[lasso_index])
#test_datx=as.data.frame(lasso_x[-index,])
lasso_test_y <- as.matrix(lasso_y[-lasso_index])
To run the LASSO Regression the library glmnet was used and the solver cv.glmnet was used.
# Fitting Lasso
lasso_alpha1_fit <- cv.glmnet(lasso_train_x, lasso_train_y, alpha=1, type.measure = 'mse', family='gaussian')
The cv.glmnet() function performs the 10 fold cross
validation on our training data and picks the best lambda value based on
least mean square cross validation error criteria. For our case, the
lambda is 4.5000064^{-5}.
# Best Lambda
lasso_best_lambda <- lasso_alpha1_fit$lambda.min
lasso_best_lambda
## [1] 4.500006e-05
The below shows the shrunk coefficients obtained from the model. The unimportant coefficients go to zero. The dots next to the coefficient names represent the shrinkage of the unimportant variables to zero.
# Coefficients
coef(lasso_alpha1_fit)
## 20 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 4.464100e-01
## installment 3.727407e-05
## log.annual.inc .
## dti .
## fico -4.718489e-04
## days.with.cr.line .
## revol.bal .
## inq.last.6mths 2.789728e-04
## delinq.2yrs .
## pub.rec .
## credit.policy.T -8.322713e-04
## credit.policy.F 4.885155e-16
## has.delinq.2yrs.T .
## has.delinq.2yrs.F .
## has.revol.bal.T .
## has.revol.bal.F .
## has.pub.rec.T .
## has.pub.rec.F .
## not.fully.paid.T .
## not.fully.paid.F .
The obtained model is used to predict the interest rate using the reserved 30% testing data and the Mean Squared Error (MSE) & R-squared value are tabulated below.
# Using the model to predict the interest rate using the test data
lasso_y_predicted <- predict(lasso_alpha1_fit, s = lasso_best_lambda, newx = lasso_test_x)
The MSE:
# Calculating the mean squared errors
lasso_mse <- mean((lasso_y_predicted - lasso_test_y)^2)
lasso_mse
## [1] 0.0002514385
The R-squared value
# R squared
# Calculating the total sum of squares(sst)
lasso_sst <- sum((lasso_test_y - mean(lasso_test_y))^2)
# Calculating the sum of squared errors (sse)
lasso_sse <- sum((lasso_y_predicted - lasso_test_y)^2)
lasso_rsq <- 1-(lasso_sse/lasso_sst)
lasso_rsq
## [1] 0.6470912
The graph of Importance of variables to predict the interest rates. In other words, these predictors play major role in predicting the interest rates.
# Plotting the variable importance graph
vip(lasso_alpha1_fit)
In our EDA we looked at loans that meet the credit underwriting
criteria vs the borrower not fully paying. Based on the
credit.policy and not.fully.paid variables, we
calculated the percentage of borrowers who did not fully pay based on if
they met the credit underwriting criteria.
| Meets Credit Policy | Percent Not Fully Paid |
|---|---|
| FALSE | 27.8% |
| TRUE | 13.2% |
From this we saw that about 13.2% of borrowers who met the credit underwriting criteria did not fully pay, while for the borrowers who did not meet the credit underwriting criteria about 27.8% did not fully pay.
Since our models for credit.policy worked well, we
decided to try the same thing using the predicted
credit.policy from these models. Let’s start with the
decision tree model:
| Meets Predicted Credit Policy | Percent Not Fully Paid |
|---|---|
| 1 | 28.8% |
| 2 | 13.6% |
Now let’s try the logistic regression model:
| Meets Predicted Credit Policy | Percent Not Fully Paid |
|---|---|
| 0 | 29.8% |
| 1 | 13.8% |
We can see that the results from using our model predictions are about the same as LendingClub’s actual credit underwriting policy. That is to say, if we decided who should get a loan using our models, for this data our models would perform about as well as LendingClub’s credit underwriting policy in terms of trying to avoid giving loans that will end up defaulting.